{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7ba907fe",
   "metadata": {},
   "source": [
    "# Notebook 1: Synthetic Data Generation for RAG Evaluation\n",
    "This notebook demonstrates how to use LLMs to generate question-answer pairs on a knowledge dataset using LLMs.\n",
    "We will use the dataset of pdf files containing the NVIDIA blogs.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2ab13af",
   "metadata": {},
   "source": [
    "![synthetic_data](imgs/synthetic_data_pipeline.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64a04c5d",
   "metadata": {},
   "source": [
    "## Step 1: Load the PDF Data\n",
    "\n",
    "[LangChain](https://python.langchain.com/docs/get_started/introduction) library provides document loader functionalities that handle several data format (HTML, PDF, code) from different sources and locations (private s3 buckets, public websites, etc).\n",
    "\n",
    "LangChain Document loaders  provide a ``load`` method and output a piece of text (`page_content`) and associated metadata. Learn more about LangChain Document loaders [here](https://python.langchain.com/docs/integrations/document_loaders).\n",
    "\n",
    "In this notebook, we will use a LangChain [`UnstructuredFileLoader`](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) to load a pdf of NVIDIA blog post."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e4b78ca",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# %%capture\n",
    "# !unzip dataset.zip\n",
    "# !pip install -r requirements.txt\n",
    "# !pip install --upgrade langchain\n",
    "!!pip install docx2txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe7094b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# take a pdf sample\n",
    "pdf_example='../ORAN_kb/O-RAN.SFG.Non-RT-RIC-Security-TR-v01.00.pdf'\n",
    "DOCS_DIR = \"../ORAN_kb/\"\n",
    "# # visualize the pdf sample\n",
    "# from IPython.display import IFrame\n",
    "# IFrame(pdf_example, width=900, height=500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74912856",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import the relevant libraries\n",
    "import os\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "from langchain_community.document_loaders import DirectoryLoader\n",
    "# from langchain.document_loaders import  DirectoryLoader\n",
    "from langchain.document_loaders import TextLoader\n",
    "from langchain_community.document_loaders import UnstructuredHTMLLoader\n",
    "from langchain_community.document_loaders import UnstructuredPDFLoader\n",
    "from langchain_community.document_loaders import PyPDFLoader\n",
    "from langchain.vectorstores import FAISS\n",
    "import pickle\n",
    "from langchain.embeddings import HuggingFaceEmbeddings\n",
    "from langchain.document_loaders import UnstructuredFileLoader, Docx2txtLoader\n",
    "import re\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13c491b2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the pdf sample\n",
    "loader = UnstructuredFileLoader(pdf_example)\n",
    "data = loader.load()\n",
    "DOCS_DIR = \"../ORAN_kb\"\n",
    "# Load raw documents from the directory\n",
    "text_loader_kwargs={'autodetect_encoding': True} #loader_kwargs=text_loader_kwargs\n",
    "raw_txts = DirectoryLoader(DOCS_DIR, glob=\"**/*.txt\", show_progress=True, loader_cls=TextLoader).load()\n",
    "raw_htmls = DirectoryLoader(DOCS_DIR, glob=\"**/*.html\", show_progress=True, loader_cls=UnstructuredHTMLLoader).load()\n",
    "raw_pdfs = DirectoryLoader(DOCS_DIR, glob=\"**/*.pdf\", show_progress=True, loader_cls=UnstructuredPDFLoader).load()\n",
    "raw_docs = DirectoryLoader(DOCS_DIR, glob=\"**/*.docx\", show_progress=True, loader_cls=Docx2txtLoader).load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62593797-4d8f-499c-a088-520dab8a9404",
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_pdfs[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8f2c5a0",
   "metadata": {},
   "source": [
    "## Step 2: Transform the Data \n",
    "\n",
    "The goal of this step is tp break large documents into smaller **chunks**. \n",
    "\n",
    "LangChain library provides a [variety of document transformers](https://python.langchain.com/docs/integrations/document_transformers/), such as `text splitters`. In this example, we will use the generic [``RecursiveCharacterTextSplitter``](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter), we will set the chunk size to 3K and overlap to 100. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3cbd9454-399b-44de-a1db-16dbded92f2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def remove_line_break(text):\n",
    "    text = text.replace(\"\\n\", \" \").strip()\n",
    "    text = re.sub(\"\\.\\.+\", \"\", text)\n",
    "    text = re.sub(\" +\", \" \", text)\n",
    "    return text\n",
    "def remove_two_points(text):\n",
    "    text = text.replace(\"..\",\"\")\n",
    "    return text\n",
    "def remove_two_slashes(text):\n",
    "    text = text.replace(\"__\",\"\")\n",
    "    return text\n",
    "\n",
    "def just_letters(text):\n",
    "    return re.sub(r\"[^a-z]+\", \"\", text).strip()\n",
    "\n",
    "def remove_non_english_letters(text):\n",
    "    return re.sub(r\"[^\\x00-\\x7F]+\", \"\", text)\n",
    "\n",
    "\n",
    "def langchain_length_function(text):\n",
    "    return len(just_letters(remove_line_break(text)))\n",
    "\n",
    "\n",
    "def word_count(text):\n",
    "    text = remove_line_break(text)\n",
    "    text = just_letters(text)\n",
    "    tokenizer = RegexpTokenizer(r\"[A-Za-z]\\w+\")\n",
    "    tokenized_text = tokenizer.tokenize(text)\n",
    "    tokenized_text_len = len(tokenized_text)\n",
    "    return tokenized_text_len\n",
    "\n",
    "\n",
    "def truncate(text, max_word_count=1530):\n",
    "    tokenizer = RegexpTokenizer(r\"[A-Za-z]\\w+\")\n",
    "    tokenized_text = tokenizer.tokenize(text)\n",
    "    return \" \".join(tokenized_text[:max_word_count])\n",
    "\n",
    "\n",
    "def strip_non_ascii(string):\n",
    "    ''' Returns the string without non ASCII characters'''\n",
    "    stripped = (c for c in string if 0 < ord(c) < 127)\n",
    "    return ''.join(stripped)\n",
    "\n",
    "\n",
    "def generate_questions(chunk, triton_client):\n",
    "    triton_client.predict_streaming(chunk)\n",
    "    chat_history = \"\"\n",
    "    response_streaming_list = []\n",
    "    while True:\n",
    "        try:\n",
    "            response_streaming = triton_client.user_data._completed_requests.get(block=True)\n",
    "            \n",
    "        except Exception:\n",
    "            triton_client.close_streaming()\n",
    "            break\n",
    "    \n",
    "        if type(response_streaming) == InferenceServerException:\n",
    "            print(\"err\")\n",
    "            triton_client.close_streaming()\n",
    "            break\n",
    "    \n",
    "        if response_streaming is None:\n",
    "            triton_client.close_streaming()\n",
    "            break\n",
    "    \n",
    "        else:\n",
    "            response_streaming_list.append(response_streaming)\n",
    "            chat_history = triton_client.prepare_outputs(response_streaming_list)\n",
    "            yield chat_history"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11a31f34",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import the relevant libraries\n",
    "from langchain.text_splitter import  RecursiveCharacterTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9377f2e1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# instantiate the RecursiveCharacterTextSplitter\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100, length_function=langchain_length_function, )\n",
    "\n",
    "all_documents=raw_pdfs+raw_docs\n",
    "\n",
    "# split all the docs\n",
    "documents = text_splitter.split_documents(all_documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a5118e0",
   "metadata": {},
   "source": [
    "Let's check the number of chunks of the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05219921",
   "metadata": {},
   "outputs": [],
   "source": [
    "# check the number of chunks\n",
    "len(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3982afd-7be6-4126-baa3-d5deb3d2828a",
   "metadata": {},
   "outputs": [],
   "source": [
    "#remove short chuncks\n",
    "filtered_documents = [item for item in documents if len(item.page_content) >= 200]\n",
    "[(len(item.page_content),item.page_content) for item in documents]\n",
    "documents = filtered_documents\n",
    "pd.DataFrame([doc.metadata for doc in documents])['source'].unique()\n",
    "#remove line break\n",
    "for i in range(0,len(documents)-1):\n",
    "    documents[i].page_content=remove_line_break(documents[i].page_content)\n",
    "#remove two points\n",
    "for i in range(0,len(documents)-1):\n",
    "    documents[i].page_content=remove_two_points(documents[i].page_content)\n",
    "#remove non english characters points\n",
    "for i in range(0,len(documents)-1):\n",
    "    documents[i].page_content=remove_two_slashes(documents[i].page_content)\n",
    "#remove two points\n",
    "for i in range(0,len(documents)-1):\n",
    "    documents[i].page_content=remove_two_points(documents[i].page_content)\n",
    "[(len(item.page_content),item.page_content) for item in documents]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6713424",
   "metadata": {},
   "source": [
    "Let's check the first chunk of the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b04d167a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# check the first chunks\n",
    "len(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52da0624",
   "metadata": {},
   "source": [
    "## Step 3: Generate Question-Answer Pairs\n",
    "\n",
    "\n",
    "**Instruction prompt:**\n",
    "```\n",
    "Given the previous paragraph, create one very good question answer pair.\n",
    "Your output should be in a json format of individual question answer pairs.\n",
    "Restrict the question to the context information provided.\n",
    "\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e58a88c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_splits = documents\n",
    "# set the instruction_prompt\n",
    "instruction_prompt = \"Given the previous paragraph, create one very good question answer pair. Your output should be in a json format of individual question answer pairs. Restrict the question to the context information provided.\"\n",
    "\n",
    "# set the context prompt\n",
    "context = '\\n'.join([all_splits[0].page_content, instruction_prompt])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9706fe7d",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# check the prompt\n",
    "print(context)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "507bc8ef",
   "metadata": {},
   "source": [
    "#### a) AI Playground LLM generator\n",
    "\n",
    "**NVIDIA AI Playground** on NGC allows developers to experience state of the art LLMs accelerated on NVIDIA DGX Cloud with NVIDIA TensorRT nd Triton Inference Server. Developers get **free credits for 10K requests** to any of the available models. Sign up process is easy. follow the steps <a href=\"https://github.com/NVIDIA/GenerativeAIExamples/blob/main/docs/rag/aiplayground.md\">here</a>. \n",
    "\n",
    "We are going to use theAI playground'ss `llama2-70B `LLM to generate the Question-Answer pairs."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b9d911c",
   "metadata": {},
   "source": [
    "Let's now use the AI Playground's langchain connector to generate the question-answer pair from the previous context prompt (document chunk + instruction prompt). Populate your API key in the cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5463fd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['NVIDIA_API_KEY'] = \"\"\n",
    "os.environ['NVAPI_KEY'] = \"\"\n",
    "from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings\n",
    "import json\n",
    "# # make sure to export your NVIDIA AI Playground key as NVIDIA_API_KEY!\n",
    "llm = ChatNVIDIA(model=\"playground_llama2_70b\")\n",
    "print(list(llm.available_models))\n",
    "an = llm.invoke(context)\n",
    "json.loads(an.content)['answer']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8475a773",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import multiprocessing\n",
    "all_contexts = []\n",
    "all_answers = []\n",
    "def llm_ans(args):\n",
    "    count, split = args\n",
    "    context = '\\n'.join([split.page_content, instruction_prompt])\n",
    "    filename = split.metadata['source']\n",
    "    #check output\n",
    "    answer = llm.invoke(context)\n",
    "#     all_contexts.insert(count,context)\n",
    "#     all_answers.insert(count,answer)\n",
    "    print(count)\n",
    "    return [context, filename, answer]\n",
    "\n",
    "n = multiprocessing.cpu_count() \n",
    "print(\"Number of cores: \",n)\n",
    "pool = multiprocessing.Pool(12)\n",
    "args = [(count,split) for count,split in enumerate(all_splits[0::300])]\n",
    "print(len(args))\n",
    "results = pool.map(llm_ans, args)\n",
    "pool.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a9284302",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "context,filename,answer = results[0]\n",
    "answer = json.loads(an.content) #['answer']json.loads(answer)\n",
    "answer['answer']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34cd8c22",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import json\n",
    "data = []\n",
    "for i in results:\n",
    "    context,filename,answer = i\n",
    "    print(answer)\n",
    "    try:\n",
    "        answer = json.loads(answer.content)\n",
    "    except:\n",
    "        continue\n",
    "    if isinstance(answer,list):\n",
    "     answer = answer[0]\n",
    "#     print(answer)\n",
    "    data.append({'gt_context': context,'document': filename,'question': answer['question'],'gt_answer': answer['answer']})\n",
    "#     print(data)\n",
    "with open('syn_data_oran.json', 'w') as f:\n",
    "    json.dump(data, f, ensure_ascii=False, indent=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5a92152",
   "metadata": {},
   "outputs": [],
   "source": [
    "len(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2321af0b",
   "metadata": {},
   "source": [
    "# End-to-End Synthetic Data Generation\n",
    "\n",
    "We have run the above steps and on 600 pdfs of NVIDIA blogs dataset and saved the data in json format below. Where gt_context is the ground truth context and gt_answer is ground truth answer.\n",
    "\n",
    "```\n",
    "{\n",
    "'gt_context': chunk,\n",
    "'document': filename,\n",
    "'question': \"xxxx\",\n",
    "'gt_answer': \"xxxx\"\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3f6a49f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "with open(\"syn_data_oran.json\") as f:\n",
    "    dataset = json.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05e6750c",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dataset[50])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "318c588f",
   "metadata": {},
   "source": [
    "# Synthetic Data Post-processing \n",
    "\n",
    "So far, the generated JSON file structure embeds `gt_context`, `document` and the `question`, `answer` pair.\n",
    "\n",
    "In order to evaluate Retrieval Augmented Generation (RAG) systems, we need to add the RAG results fields (To be populated in the next notebook):\n",
    "   - `contexts`: Retrieved documents by the retriever \n",
    "   - `answer`: Generated answer\n",
    "\n",
    "The new dataset JSON format should be: \n",
    "\n",
    "```\n",
    "{\n",
    "'gt_context': chunk,\n",
    "'document': filename,\n",
    "'question': \"xxxxx\",\n",
    "'gt_answer': \"xxx xxx xxxx\",\n",
    "'contexts':\n",
    "'answer':\n",
    "}\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
